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H. M. Kalayeh and D. A. Landgrebe* 


ABSTRACT 


In this paper a criterion which measures the quality of the esti- 
mate of the covariance matrix of a multivariate normal distribution is 
developed. Based on this criterion, the necessary number of training 
samples is predicted. Experimental results which are used as a guide 
for determining the number of training samples are included. 
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1 . INTRODUCTION 

In practice, the number of training samples is frequently limited 
because it is expensive to collect many training samples. A typical 
application in which this is the case is the field of remote sensing, 
and we will use this application to illustrate the technique. 

In remote sensing, the reflected and emitted electromagnetic energy 
of each pixel of a scene in several important wavelength bands is mea- 
sured by a multispectral remote sensor system mounted on board an air- 
craft or spacecraft. The output of the sensor system is used to form a 
point in a q-diraensional space [6]. A commonly used pattern classifica- 
tion algorithm in this application is the maximum likelihood Gaussian 
scheme. In this instance, the classes are each characterized as a Gaus- 
sian distribution in q-space and these distributions in turn are speci- 
fied by estimates of the means and covariances of each. However, we 
know that the performance of the estimators is dependent on the number 
of training samples. In the case of limited training samples, the esti- 
mates of the first and second order statistics cannot accurately depict 
all the information which is contained in the data. In particular, the 
estimate of the covariance matrix may be poor. As a result of this poor 
estimation, later analysis of the data (for example, classification 
accuracy and statistical distance measures) will be degraded. See [l] 
for more details. Therefore, it is important to predict how many sam- 
ples will be needed in order that the performance cf the estimators be 
statistically reasonable. In the following, a criterion is developed to 
measure the performance of the estimate of the covariance matrix; then 
the number of required samples is predicted. 
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2. PREDICTION CRITERION 


Let Xp Xp...X n be q-diraensional random sample vectors which are 
drawn from a normally distributed population with parameters 0 = (M,£), 
where M is the true mean vector and £ the true covariance matrix. In 
practice, M and £ are not available, so they must be estimated from the 
observed data. The maximum likelihood estimates of M and £ are: 


A 

M 


1 

N 


N 

£ 

i=l 


( 1 ) 


£ = i £ (X - M) (X - M) T 
N i=1 1 i 


( 2 ) 


For more detail, see [2]. 

The performance of an estimator is measured by properties, such as 
whether it provides (a) an unbiased estimate, (b) a consistent estimate, 
(c) an efficient estimate, and (d) a sufficient estimate. Now, let us 
study the properties of maximum likelihood estimates of M and £ . From 
[2 ] we have: 


E[M] = M 


(3) 


Cov[M] = - £ 
N 


(4) 



£ 


( 5 ) 
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Thus, by definition, M is an unbiased estimate of M, but E is not 
an unbiased estimate of E. However, if 


ri rf 

E = ~r £ (X. - M) (X - M) 

N_1 i=l 1 1 


then eCx 3 = £ which is unbiased. The density function of M and E are: 


? 5^ r— exp{-^(M-M) T NZ-1 (M-M) } 

(2-)f Is 2 \ h 


i( E ) = 


(N-l) q [ l | 


That is, E ) , a normal distribution and E^(e,N), a wishart dis- 
tribution. For more details of other properties of these estimators, 
see [2,3] and for various properties of the wishart distribution see 


Though the distribution of E is complex, the performance of the 
estimates of the covariance matrix which are of interest can be measured 
by the variance of the diagonal components of E, as follows: 


°kV ■ iwT jij (x i k - V : 



-5- 


In [3] it is shown that (N— 1 ) 
(N-1) degrees of freedom. And 



has a chi-square distribution with 


- V 

(10) 

II 

<D 1 

1 

u 

(11) 

L kkJ 


r i 2 ° kk 

V l kk J = N_1 

(12) 

vari^l . 2 

(13) 

L°kkJ N-1 



t 

Now let Y = A <f> X where $ and A are the eigenvector matrix and the 
eigenvalue matrix, respectively, of the covariance matrix, Cov(Y) = I, 
and in practice 41, A are the eigenvector matrix and the eigenvalue 
matrix of I. Therefore, Y = A <J> X and cov(Y) = I and let the diagonal 
element of this matrix be Because of the orthnormal transforma- 

tion, the features in the new space are independent; therefore, (N-1 ) 
has chi-square distribution with (N-1 ) degrees of freedom. For brevity, 
let: 



(N-DYkk * X 2 (N-1) 

(HO 

and 

Q = [Y n +... + Y qq ] 

(15) 

then 

(N-1)Q n, X 2 (q(N-1)) 

(16) 


E[(N-1)Q] = q(N-1)) 

(17) 


E[Q] - q 

(18) 


var [(N-1 )Q ] = 2q(N-1) 

(19) 
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var[Q] = 


iq 

N-l 


( 20 ) 


A logical choice for our prediction criterion is var(Q) because It mea- 
sures the dispersion of the estimate of the covariance matrix. 


To see 

A 

var(Q) < a. 


how to apply the criterion, 
Therefore, from (20) 


suppose it is desired that 


N > 1 + 


2q 

a 


( 21 ) 


Note that the minimum value of N is q ♦ 1 , because If N is less than q + 
1, then the covariance matrix will be singular. So, 


var(Q)max = ^ _ . = 2 (22) 

N , -1 
min 

A 

A plot of the var(Q) as a function of N with q as a parameter is shown 
in Figure 1. Now, if for example a = 0.2, then N > 1 + lOq. 

The next question to be addressed is how does one choose a reason- 
able value for a. To answer this question, let us consider the follow- 

A 

ing. As shown in Figure 1, if N > 1 + lOq, then var(Q) is decreasing 
very slowly and its slope is small, less than -.02/q. This suggests 
that if N = 1 ♦ lOq, then the statistical distance between the true 

probability density and the estimated one may be close to zero. The 
transformed divergence [5, 6] is a useful statistical distance measure and 
is given by 


D t = 2000 [1 - exp (- D/8)], 


(23) 
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Figure 1. Variance of Q as a function of number of training samples N. 


where 
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D = *itr(5:-Z)(z' l -r l ) +%tr(E‘ 1 +r" l )(M-M)(M-M) T (24) 

We will use J.t to experimentally measure the quality of the estimates of 
the parameters and also as a guide to choosing a or N. The following 
procedure provides a practical means for doing so: 

1. Assume that the true probability density of the data is normal 
with mean vector M and covariance matrix E. 

Based on the true parameters of the distribution, data 

points are randomly generated. 

3. The parameters of the distribution are estimated based on the 

randomly generated samples and then, using transformed 
divergence, the statistical distance between the true probabil- 
ity density and the estimated one Is computed. 

4. Step 3 is repeated five times and the average transformed 
divergence is calculated. 

5. The average transformed divergence for different values of 
var(Q) is computed and shown In Figure 2. 

The result in Figure 2 shows almost a linear relationship between D T 
and var(Q). This implies that when var(Q)=var(Q)max = 2, then 

D T *(D T )max = 2000. This indicates that the quality of the estimates of 

A 

the parameters is very poor. However, if var(Q) r 0.2, then D T = 175, 
which suggests that the estimated probability density is very close to 
the true one. In practice, however, the true parameters of the distri- 


0.2 0.4 0.6 0.8 1.0 


2.0 Var(Q) 


Figure 2. The average transformed divergence as a function of variance of Q. 
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butlon are not available and neither la the transformed dlvergenoe. As 
mentioned earlier, a logical choice for our prediction criterion in 
var(Q) because it measures the dispersion of the estimate. 

We have found that D T *500, or equivalently, a * 0.4 Is a logioal 
threshold to decide whether the estimates of the parameters are good or 
not. This choice implies that the number of training samples should not 
be less than 1 + 5q. However, we believe by using Information given in 

A 

Table 1, one should be able to establish an upperbound on v%r(Q) and 
consequently estimate the required number of training samples. 

3. CONCLUSION 

The main purpose of this paper was to develop a criterion to mea- 
sure the dispersion of the estimate of the covarlanoe matrix of a multi- 
variate normal distribution and, based on this criterion, to be able to 
predict the necessary number of training samples. To accomplish this, 
the variance of Q = tr(I = A ^ ’) was chosen as the predictor cri- 

terion. It was theoretically shown that variance of Q is equal to 
with maximum value of 2. Also, the divergence between the true distri- 

A 

butlon and the estimated one for different values uf variance of Q was 
experimentally computed and used to establish an upperbound on the varl- 

A 

ance of Q. It was suggested that the required training samples should 
be about five times the number of features. 
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Table 1. Distance between the true distribution and estimated 
one as a function of var(Q) or number of training samples. 


A 

var(Q) 

d t 

D 

N 

1.00 

1250 

7.65 

1 + 2q 

0.50 

675 

3.40 

1 ♦ 4q 

0.0 

500 

2.30 

1 ♦ 5q 

0.25 

210 

0,80 

1 + 8q 

0.20 

175 

0.70 

1 + lOq 
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